We explore joint training strategies of DNNs for simultaneous dereverberation and acoustic modeling to improve the\nperformance of distant speech recognition. There are two key contributions. First, a new DNN structure incorporating\nboth dereverberated and original reverberant features is shown to effectively improve recognition accuracy over the\nconventional one using only dereverberated features as the input. Second, in most of the simulated reverberant\nenvironments for training data collection and DNN-based dereverberation, the resource data and learning targets are\nhigh-quality clean speech. With our joint training strategy, we can relax this constraint by using large-scale diversified\nreal close-talking data as the targets which are easy to be collected via many speech-enabled applications from\nmobile internet users, and find the scenario even more effective. Our experiments on a Mandarin speech recognition\ntask with 2000-h training data show that the proposed framework achieves relative word error rate reductions of 9.7\nand 8.6 % over the multi-condition training systems for the cases of single-channel and multi-channel with\nbeamforming, respectively. Furthermore, significant gains are consistently observed over the pre-processing\napproach using simply DNN-based dereverberation.
Loading....